{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Permutation Importance with Multicollinear or Correlated Features\n\n\nIn this example, we compute the permutation importance on the Wisconsin\nbreast cancer dataset using :func:`~sklearn.inspection.permutation_importance`.\nThe :class:`~sklearn.ensemble.RandomForestClassifier` can easily get about 97%\naccuracy on a test dataset. Because this dataset contains multicollinear\nfeatures, the permutation importance will show that none of the features are\nimportant. One approach to handling multicollinearity is by performing\nhierarchical clustering on the features' Spearman rank-order correlations,\npicking a threshold, and keeping a single feature from each cluster.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>See also\n    `sphx_glr_auto_examples_inspection_plot_permutation_importance.py`</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\nfrom collections import defaultdict\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom scipy.stats import spearmanr\nfrom scipy.cluster import hierarchy\n\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.inspection import permutation_importance\nfrom sklearn.model_selection import train_test_split"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Random Forest Feature Importance on Breast Cancer Data\n------------------------------------------------------\nFirst, we train a random forest on the breast cancer dataset and evaluate\nits accuracy on a test set:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "data = load_breast_cancer()\nX, y = data.data, data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\nclf = RandomForestClassifier(n_estimators=100, random_state=42)\nclf.fit(X_train, y_train)\nprint(\"Accuracy on test data: {:.2f}\".format(clf.score(X_test, y_test)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we plot the tree based feature importance and the permutation\nimportance. The permutation importance plot shows that permuting a feature\ndrops the accuracy by at most `0.012`, which would suggest that none of the\nfeatures are important. This is in contradiction with the high test accuracy\ncomputed above: some feature must be important. The permutation importance\nis calculated on the training set to show how much the model relies on each\nfeature during training.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "result = permutation_importance(clf, X_train, y_train, n_repeats=10,\n                                random_state=42)\nperm_sorted_idx = result.importances_mean.argsort()\n\ntree_importance_sorted_idx = np.argsort(clf.feature_importances_)\ntree_indices = np.arange(0, len(clf.feature_importances_)) + 0.5\n\nfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))\nax1.barh(tree_indices,\n         clf.feature_importances_[tree_importance_sorted_idx], height=0.7)\nax1.set_yticklabels(data.feature_names)\nax1.set_yticks(tree_indices)\nax1.set_ylim((0, len(clf.feature_importances_)))\nax2.boxplot(result.importances[perm_sorted_idx].T, vert=False,\n            labels=data.feature_names)\nfig.tight_layout()\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Handling Multicollinear Features\n--------------------------------\nWhen features are collinear, permutating one feature will have little\neffect on the models performance because it can get the same information\nfrom a correlated feature. One way to handle multicollinear features is by\nperforming hierarchical clustering on the Spearman rank-order correlations,\npicking a threshold, and keeping a single feature from each cluster. First,\nwe plot a heatmap of the correlated features:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))\ncorr = spearmanr(X).correlation\ncorr_linkage = hierarchy.ward(corr)\ndendro = hierarchy.dendrogram(corr_linkage, labels=data.feature_names, ax=ax1,\n                              leaf_rotation=90)\ndendro_idx = np.arange(0, len(dendro['ivl']))\n\nax2.imshow(corr[dendro['leaves'], :][:, dendro['leaves']])\nax2.set_xticks(dendro_idx)\nax2.set_yticks(dendro_idx)\nax2.set_xticklabels(dendro['ivl'], rotation='vertical')\nax2.set_yticklabels(dendro['ivl'])\nfig.tight_layout()\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we manually pick a threshold by visual inspection of the dendrogram\nto group our features into clusters and choose a feature from each cluster to\nkeep, select those features from our dataset, and train a new random forest.\nThe test accuracy of the new random forest did not change much compared to\nthe random forest trained on the complete dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "cluster_ids = hierarchy.fcluster(corr_linkage, 1, criterion='distance')\ncluster_id_to_feature_ids = defaultdict(list)\nfor idx, cluster_id in enumerate(cluster_ids):\n    cluster_id_to_feature_ids[cluster_id].append(idx)\nselected_features = [v[0] for v in cluster_id_to_feature_ids.values()]\n\nX_train_sel = X_train[:, selected_features]\nX_test_sel = X_test[:, selected_features]\n\nclf_sel = RandomForestClassifier(n_estimators=100, random_state=42)\nclf_sel.fit(X_train_sel, y_train)\nprint(\"Accuracy on test data with features removed: {:.2f}\".format(\n      clf_sel.score(X_test_sel, y_test)))"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}